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PREFACE 



This monograph describes the work performed by R & D 
Consultants Company under Contract #OEC-0-9-I40548- 
2791(095) with the Office of Education of the Depart- 
ment of Health, Education, and Welfare, The contract 
is titled "A Computer-aided Study of Access Management 
and Collection Management in Libraries"; its principal 
objectives are the development of a model for information 
access and storage systems, and the' study of the structure 
of existing access systems with the intent of augmenting 
them in significantly useful ways by means of automated 
processing of machine readable data bases. 

The specification of such a model naturally requires 
considerable mathematical and statistical detail that 
makes for dry reading at best. We have therefore pre- 
pared a rather extensive introduction that summarizes 
the findings with only a minimum of documentation and 
then provided the necessary backup in the following 
chapters. In addition to the material contained herein, 
the contract called for a study of computer programming 
languages as they apply to problems in the library. At 
the invitation of the Editor of the Journal of Documentation 
and with the permission of the contract officer , tills 
study was published in the June 1971 issue of that 
journal under the titles 

PROGRESS IN DOCUMENTATION; Programming Languages 

in Mechanized Documentation, 

Throughout the course of this study we have been indebted 
to Mr, Lawrence S. Papier of the Office of Education who 
has provided many helpful suggestions both with regard to 
the plan of our research and the problems of documenting 
the results . 

The authors also wish to express their appreciation to 
Richard O'Keefe and other members of the library staff 
of the Fondren Library, Rice University for their generous 
help and cooperation in the selection of the Fondren Index 
Sample which provides the central data base of this study. 

We are also indebted to Richard De Gennaro and Foster Palmer 
of the Harvard University Library for making available in- 
formation about the contents of the Widener Shelfllst which 
enabled vis to determine the dynamic structure of their 
classification system and also for the five year summary 
of their circulation statistics; to the late Gerald Mitchell 
of the Institute for Defense Analysis who aided us in the 
preparation of the distribution of digraphs; to the Confer- 
ence Board of the Mathematical Sciences, and particularly 
the NISIMS Committee , who supported those aspects of this 
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work particularly concerned with accessing mathematical 
archives; to John W . Tukcy , the Statistical Research 
Techniques Group of Princeton University and the National 
Science Foundation who supported the work on algorithmic 
indexing and made available for this study preliminary 
output from their permuted title listings of the retro- 
spective file of statistical papers; and to M - L. Puri, 
Department of Mathematics, Indiana University, for his 
thoughtful contributions to the study of the mathematical 
models of access systems. 

Finally, we should like .to acknowledge the contributions 
of the staff of R & D Consultants Company; William E , 
Houchin, particularly for his work on the information 
theoretic aspects of the problem; Val Forsyth for her 
invaluable contributions to the overall data handling 
problems; and to Joan Resnikoff and Rena: Wells for their 
painstaking efforts in analysing in fine detail the index 
structure of the Fondren Index Sample, 
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LIBRARY ACCESS SYSTEMS 



INTRODUCTION 

For some years many observers on the information science scene 
have been commenting on the "information explosion" and the 
effect this has on the librarian and on the library user- The 
fundamental assertion is quite simple: libraries grow exponenti- 

ally. It is easy to show that this phenomena has persisted at 
least since Gut enburg . As long as the base is very small , expo- 
nential growth can be coped with. Sooner or later, however, 
repeated doubling of even a very small base every 20 or 30 years 
will lead to a very large base. When it becomes clear that an 
information base is already so large as to strain our ability to 
control and direct it, doubling its present size within another 
20 or 30 years can only be viewed with considerable concern. 

Whether library collections have reached such a stage at the 
present time is debatable. Some arguments have been put forth in 
recent years to the effect that a universal collection is now 
obsolete; that even the largest public and university libraries 
will soon have to move towards specialization of their collections 
and increased dependence on each other to achieve comprehensive 
coverage. Nevertheless, many large libraries continue to grow at 
their accustomed rate and new libraries of increased capacity con- 
tinue to be built. 

We raise the question here not with the hopes of resolving it, but 
rather to emphasize a self evident point; the size of a library 
collection is of fundamental importance . This should not be con- 
strued as implying that the quality of a collection is unimportant, 
but to stress that there are a number of basic problems concerning 
libraries that depend almost totally on questions of size rather 
than quality , however quality may be measured. 

That collection size is important to a user may be simply illus- 
trated by a rather mundane set of examples. Consider, for instance, 
a collection of two or three dozen books on a desk. Clearly, the 
arrangement of such a collection is of no importance whatsoever. 

The scanning speed of the normal human eye and the recognition 
mechanism of the brain is so fast that one can locate a desired 
book even while the arm mechanism is reaching forward to retrieve 
it. However, if we consider the 800-1,000 books that one can 
comfortably store on shelves on one wall of a modest size office, 
some degree of organization becomes necessary. In such a collec- 
tion, physical size normally plays an important role as grouping 
of books by size makes for more efficient use of space. But size 
is also an important visual key for locating a book * Equally, 
color is useful both for aesthetic considerations and location 
keys. Size and color are generally not incompatible with a rough 
subject grouping, particularly if the collection consists of one 
or more series of publications. 
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At this level it may also be useful to note that there is a kind 
of Parkinsonian law in operation: the size of a collection 

expands rapidly to fill the available shelf space. Thus a gap 
of open shelf in such a collection is more likely to indicate 
the absence of a book rather than deliberate planning for future 
expansion, thereby providing an elemental circulation system 
including a simple mechanism for determining the spot for 
returning the book after use. Collection growth is normally 
handled by temporary storage on desk and table space until there 
is sufficient incentive to add a new shelf or set of shelves at 
which time "the new books" are not infrequently all shelved to- 
gether, thus providing another retrieval key for the personal 
access system: time of acquisition. 

Unprofessional though it may appear in terms of professional 
librar ianship , such a system is both effective and cost-effective. 
It is designed for the use of only a few people — perhaps only one — 
and it is presumed that these users are intimately familiar with 
the system. Periodic rejuggling of the storage positions is not 
only costly, but detrimental: it breaks down the simple access 

system that is quite capable of remembering that the needed docu- 
ment is that "medium sized blue book with the red stripe on the 
fourth shelf near the door." No catalog system in the world can 
beat that kind of retrieval speed. 

When we move up to the 25-30,000 books normally found in a small 
public or college library, the access system must become more 
formal if for no other reason than the fact that there will be 
many more users, including a host of infrequent ones who must 
operate with reasonably simple instructions. At this level, only 
a handful of users (and certainly not all of the library staff) 
will be intimately familiar with the entire collection. This is 
not to say that personal knowledge of the collection is unimportant 
or that individual variations in the ways in which the books are 
shelved do not exist. Every experienced library user knows that 
the fastest way to determine if a book is in the collection, and 
if so where it is to be found, is to ask the librarian. In fact 
this is so well known that all librarians develop subtle and not 
so subtle techniques for fending off such requests both to 
preserve their sanity and to give themselves some time to attend 
to their other duties. 

However, a librarian who is never willing to guide a user to a 
book does not recognize the nature of the system. All modest 
size libraries vary from standard cataloguing practice in certain 
ways if only to keep cataloguing costs in line. Standardization 
is mainly useful to the user who moves about from one library to 
another over a period of time and does not wish to invest the time 
necessary to acquaint himself with the vagaries of a particular 
library the 'first time he has need of its contents. For such a 
user, personal direction is of great value. 
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The progression on to larger and larger collections creates more 
and more complex access problems, A large university library is 
so complex that no one librarian is able to personally familiarize 
himself with all of it. Instead, a staff of reference librarians 
is maintained, each covering different specialties. The complex 
access system is now so large that the user may need to consult 
the librarian to find an item in the access system where in a 
smaller library he could expect the same effort to provide him 
with the book itself. 

In brief, every library user quickly learns that size is a barrier 
to access and that his best strategy in trying to locate a book 
is to head for the smallest collection that is likely to contain 
or point to that book. According to this standard, a sophisticated 
user is one who can exercise good judgement in this regard. The 
primary rule presumably is that the older (and/or rarer) the book, 
the more likely it is that one will have to go to a large collec- 
tion. Other properties of the document such as language, place 
of imprint, subject matter, etc. clearly enter into the exercise 
of this judgement. It is curious that libraries do not, in 
general, provide detailed information of this kind about their 
holdings so that users can exercise this judgement more efficiently. 
Precise counts from the card catalog would, of course, be costly 
to obtain and many librarians might be loathe to publish their 
opinions about the approximate breakdown of their holdings by 
language, place of publication, etc., but these factors may not 
outweigh the utility of such descriptive information. 

Deriving counts from a machine readable catalogue is quite simple 
and relatively inexpensive so it is to be hoped that as more 
libraries shift to machine cataloguing they will follow Harvard's 
lead in publishing refined descriptions of their holdings by 
language, date of imprint, subject, etc. 

Up to this point we have constrained our discussion to the 
problem of finding a book within a set of books. Until recently, 
few would question that this was a fundamental problem in 
librarianship , if not the fundamental problem. Today some 
authors would prefer to view all requests placed at libraries 
as requests for information, many of which could be best served 
by supplying the information itself rather than by directing the 
user to a document containing the information. In the frame in 
which we view the problem this is equivalent to requiring a much 
larger access system than most libraries could currently afford. 
However, even assuming an increase in funds for libraries and/or 
a decrease in costs for access systems, it is still not clear 
that requests for books will disappear. If one wants to read 
Oliver Twist presumably nothing else will do and classifying such 
a request as an "information request" in the interests of 
obtaining a unified theory does little to change the problem. 

The user still wants the book. Nor is this phenomenom restricted 
to fiction. Even such simple requests as "what is the current 
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population of the United States” or "what is the most recent 
estimate of the speed of light" are frequently phrased in a con- 
text that demands not only a proper definition of the source, 
but also ancillary information about the methods used to obtain 
the estimate and the author's own views on the strengths and 
weaknesses of his procedures. In short, the user will frequently 
require access to the document containing the information and, 
in many cases, access to the supporting documents cited. 

Nevertheless, there are many proper user requests that are 
shaped as requests for information rather than for specific 
documents and it is well to consider the effect of collection size 
in such a situation. Here again it is clear that a law of parsi- 
mony is in operation. In short, one does not approach the Library 
of Congress to determine the size of a badminton court. Or at 
least one should not. Not only librarians but many other infor- 
mation sources are continually plagued by questions that could be 
more efficiently answered by reference to the nearest desk— sized 
dictionary or one volume encyclopedia. It takes a patient member 
of a library staff to handle such requests in a manner that is 
likely to advance the user another step in user sophistication. 

The education of users is clearly a critical aspect in the effec- 
tiveness of any information system. 

Even on the basis of elementary arguments it seems reasonable to 
conclude that size is indeed a critical factor in the evaluation 
of collections of books and documents. The larger the collection 
the more likely it will be that the needed information will exist 
in it, and the more difficult it will be to find it. It is only a 
short step from such an observation to the hypothesis that the 
size of the access system will also be of importance, and will 
also be most effectively used by resort to a law of parsimony. 
Indeed, it is essential to recognize that the access system it- 
self is typically a collection of pieces of information, not just 
a set of pointers to an information collection. 

It is customary to think of a catalogue card as a container for 
a collection of information about a book, including information 
about its location. However, it is more than this. It also 
contains a subset of the information in the book and, no matter how 
small, this is in fact information. And it may just be the 
information the user needs. Titles contain information. Contents 
notes contain information. It is not unusual to find information 
about the author which is not contained in the book itself. 

Further, the collection of catalogue cards provides information 
that few if any of its books are likely to contain. To the 
extent that the statistics are available, any description of a 
library serves also as a description of the community it serves, 
biased to be sure by the collective decisions of the acquisitions 
staff over a period of years, but still descriptive of qualities 
that are very difficult to study from any other source. Where the 
collection in a particular field is large enough to be considered 
representative, or even definitive, statistics on the holdings can 
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be most useful to an author making decisions on what to include 
in an introductory or expository work. 

When one moves on to a study of other access devices such as 
indexes, abstracts, special bibliographies, permuted title 
lists, citation indexes, or cumulative lists of tables of con- 
tents, it is even easier to argue that each such device plays 
both roles . a container of information, and a pointer to other 
information- But for that matter, the book itself plays both 
roles; it not only contains information; it points to other con- 
tainers of information through footnotes, citations, appended 
bibliographies, and remarks in the text itself. Thus a book ijs 
an access device as well as an information container. 

What then is the fundamental difference between the book and the 
catalogue card, or the index and the table of contents? Both con 
tain information; both point to information. The user, again 
operating under a principle of parsimony, goes to the smallest 
container, or set of containers, that is likely either to contain 
the information or point to a small set that does contain the 
information. The entire system operates under the fundamental 
assumption that the user is only willing to scan a certain amount 
of material to find what he wants. Both the author of the book 
and the author of* the catalogue, in somewhat different ways, 
attempt to break down the sum total of knowledge into bite-sized 
pieces and organize those pieces in various orderings so that the 
user can thread his way through the maze to the bite that he 
needs - 

Both questions of order and questions of size are of fundamental 
importance in any formal inquiry into the structure of infor- 
mation systems, A formal investigation into why certain order- 
ings of information are useful and why others are not (or are 
of marginal utility) would require a much deeper understanding 
of the structure of information than is presently available. 
Ordering a library catalogue by author is presumably useful 
because almost all libraries do it. But trying to decide whether 
librarians do this because users remember authors or whether 
users remember authors because they know that authors are a 
useful access point in most catalogues is not likely, at least 
for the present, to bring us much closer to a proper under- 
standing of how such systems work. We are therefore forced to 
take the view that a new ordering is by definition useful if 
some segment of the community is willing to pay for its initial 
production and maintenance. 

It is in this context where we can see more clearly than in any 
other the potential impact of the use of computers in libraries. 
The great cost in the use of computers in this area is the cost 
of initial programming and the cost of data base acquisition. 
Marginal costs of producing new orderings of a data base are 
relatively small compared with the cost of obtaining the first 
ordering. The more the data base is "exploded," the smaller 



the unit coat of material produced. Several examples should help 
to put the problem in perspectives 

Permuted Titles - The title is keyed once, with associated 
information about the author, source, etc, and then 
exploded by a factor of from 5 to 6 to produce access to 
each significant word in the title. 

MARC. The data is keyed once and then exploded by tape 
copying for use in many libraries and commercial firms, 
some of which explode the records again, e.g. for pro- 
ducing the several copies necessary for maintenance of 
their card catalogue. 

Widener Shelf List . The shelf list is keyed once, and then 
exploded at the first level by generation of the shelf 
list itself together with alphabetical and chronological 
listings of the same entries. A second level explosion 
occurs through listing through the machine (first by line 
printer, more recently by computer typesetting) and the 
printing of copies through normal book production. 

MEDLARS . The material is keyed once for production of 
Index Medlcus and then exploded through on-line and batch 
processing of information retrieval requests. 

Other examples involving the production of book catalogues for 
county library systems (where in many cases the explosion extends 
to copies for local schools), citation indexing, and various 
forms of union lists are now in fairly wide use. 

The Widener Library Shelf List is of particular interest because 
it provides an important example of the interplay between size 
and ordering. The chronological listing represents a new 
ordering, at least for a collection of this size, and it will 
be of interest to assess its utility after a period of time. 

We shall later make use of this feature to study the dynamics 
of the classification system. The alphabetic listing is not 
new; indeed such listings go back to antiquity. Further, 
special listings for subcollections are probably nearly as old. 
However , the systematic listing by alphabet for each main cate- 
gory of the classification system for a system of this size is 
only possible with the machine help. Provision of this infor- 
mation in addition to the alphabetic listing in the public 
catalogue makes it possible for the user who has reason to 
believe that the material he is searching for is in, say, the 
American History class , to go directly to a much smaller collec- 
tion for his search, with the attendant time savings. In other 
words, the machine not only provides the possibility of exper- 
imentation with new orderings, it also permits one to exercise 
access judgements of a variety of choices of the size of the 
traditional listings . 



Such reorganization of information is not limited to computer 
dependent schemes as is evidenced by the recent popularity of 
the undergraduate library concept in schools with large main 
library holdings. 

It is the purpose of this study to try to provide fresh insight 
into the nature of library problems by systematically studying 
the question of size in various information contexts. In this 
introduction we have tried to illustrate the role that size plays 
from the user ' s point of view. In the sections that follow we 
shall study the card catalogue, the classification system, and 
various other access mechanisms- We will determine their size 
characteristics and show the impact of these considerations on 
the creation and use of access mechanisms. Finally, we shall 
devote the several chapters that follow to the more extensive 
statistical and mathematical justification necessary to provide 
a solid base for future study, improvement, and design of infor- 
mation access systems. 
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THE CARD CATALOGUE 

The primary library access device is the card catalogue. In 
simplest terms, the catalogue is a set of linear files: the 
shelf list, the subject heading file, the author or author-title 
file, the new accessions list, etc. Let us consider the problem 
of finding a particular entry of known form in one of these 
linear files. Clearly, if the file is of any length, it will 
be ordered by some filing rules (with which we will assume the 
user is familiar) and a superstructure of guides will be imposed 
to permit the user to move rapidly to the general area in which 
the required item is to be found. 

The natural superstructure for a linear file is hierarchal; in 
this case taking the form of cabinets which contain drawers 
which in turn are partitioned into sets of cards further separated 
by file guides. The user first scans the cabinet labels to 
locate the right cabinet, then he scans the drawer labels to 
locate the proper drawer, then he scans the file guides to locate 
the correct subset of cards, and finally he scans the card head- 
ings individually to find the desired card. It is perhaps worth 
noting that many libraries neglect to provide cabinet labels 
that can be scanned in the first step, thus requiring the user 
to scan the relatively small drawer labels in order to locate 
the proper cabinet. 



Several authors have studied the problem of determining optimal 
strategies for establishing the proper number of file guides, 
the proper size of a card drawer, etc. (See for example, 

Shoffner (1) and Lipetz and Song (2)). In the simplest case, if it 
be assumed that scanning speed (and hence, cost) is the same at 
every level of the access structure, it is easy to show that the 
optimal strategy is to design each level of the structure in 
such a way that it decomposes the next level into a set of file 
segments of equal size, say K segments, where K is independent 
of the level. In terms of the card catalogue, this would imply 
that we should have K cabinets, each consisting of K drawers, 
each containing K file guides , each of which serve as separators 
for precisely K catalogue cards. See Chapter III for the 
technical details. 

Determination of the proper value of K is not so easy. If we 
totally neglect the cost of providing and maintaining the access 
structure and choose that value of K which minimizes the 
searcher's costs we find that K should be equal to the natural 
constant of the calculus, e — 2.718... . As catalogue cards 

are integral units, we are forced to choose K as an integer 
value , either 2 or 3. The choice K=2 corresponds to a binary 
search, a procedure that is widely used in file searching in com- 
puters . 



However, it is not reasonable to neglect the cost of providing 
and maintaining the access system. The smaller the value of K, 
the greater the cost of the access system. Formally , if S is 
the size of the linear file, then the size of the optimal level 
structured access system, A, is related to K and S by the 
simple formula 



A = 



S - 1 
K - 1 . 



This leads to a typical problem in optimization: As K 

decreases towards the natural constant e, scanning time (and 
hence cost) decreases, but access system costs increase, slowly 
at first and then rather rapidly. Thus there is presumably an 
optimal value of K that minimizes the total system cost, i.e. 
the sum of the user costs and the access system costs. In 
theory, these costs could be measured and an optimal value for 
K thereby determined. However, it is not easy to obtain such 
cost data — particularly those associated with user scan time — so 
we choose instead to adopt a standard procedure from the field 
of operations research where such questions occur routinely: we 

shall assume that current practice is constrained by economic 
restraints to be close to optimal and determine the value of K 
currently in use. 



From elementary considerations, it is evident that the value of 
K currently in use is in the neighborhood of 30. File cabinet 
construction varies, but the most popular size is the 4 by 8 
cabinet containing 32 drawers. However, as cabinets are 
normally placed side by each, the distinction of "cabinet" is 
largely lost from the visual point of view (perhaps explaining 
why so many libraries fail to provide large label designations 
for each cabinet) , A more significant measure can be found by 
determining the average number of cards per drawer. For the 
Fondren Library at Rice University this was found to be 826 
(see (3)). The drawer is, by itself, a two-level file consist- 
ing of cards and file guides. The earlier derivation for a 
N— level file reduces to the following result for a two - lev o 1 
file: the number of file guides should be equal to the square 

root of the number of cards. This is the main conclusion of 
the Lipetz and Song study (2) . Now the square root of 826 is 
28,74, again a number in the vicinity of 30. 

Now consider a typical university library. In such a library 
the shelf list or other simple linear catalogue file*is a full 
four-level access system: cards, file cards, drawers, and 

cabinets. The mean size of a university library, computed from 
those listed in (4), is 757,354 volumes. As this is a four- 
level system, K is equal to the fourth root of 757,354 or 
K = 29.50, again a value very close to 30. 



* 




Intermixed subject-title-author files naturally multiply the 
total number of cards by nearly 3 and modify the details but 
not the essential result of this argument. 




